This Exploratory Data Analysis (EDA)’s data is retrieved from a previous study Baghela (2022).
RNA-Seq and clinical data were gathered from five hospitals
(“cohorts”) consisting of 348 patients from four emergency rooms (ER),
one intensive care unit (ICU), and 44 controls. 82 samples were from the
ICU, and the rest were from the ER. The data is distributed over five
different datasets, two of which are of particular interest and the
other four mainly containing IDs of genes and gene products. As
mentioned, the data has been used in a previous study, in which the
researchers were focused on extracting differentially expressed genes
(DEGs) related to sepsis severity and mortality. Additionally, these
researchers established endotypes based on immune-related genes. They
identified five endotypes that were mechanistically and biologically
different. We aim to perform a similar experiment during this project,
but now entirely focusing on mitochondria-related genes. The approach is
different since we want to extract those genes only, rendering our gene
set size relatively smaller than the previous study, using all the genes
in question. When establishing mitochondria-driven endotypes, we plan on
extracting them with two essential variables (“class variables”):
sepsis_severity and mortality.
sepsis_severity consists of three different levels and is
based on SOFA scores, namely ‘High’ (SOFA of above or equal to 5),
‘Intermediate’ (SOFA of above to and below 5), and ‘Low’ (SOFA of below
2). mortality is based on the outcome of a patient’s
admission, either ‘deceased’, ‘survived’, or ‘unknown’. We plan on using
these two variables to extract DEGs (more on that in
de_and_pathway_analysis.Rmd). The main focus of this
project is to undermine whether mitochondria-related genes have any
“power” in predicting sepsis severity, not to necessarily come to a
grandiose conclusion. Nevertheless, they were instead discovering if
there is a way forward for more personalized medicine by focusing on
mitochondria-related genes.
The data consists of the following datasets:
We will only give a quick overview of the metadata and the raw counts’ data features since these contain most of the essential data. Most of the data has been ‘cleaned’. However, since we are working with a subset of the raw count’s dataset, we will probably discover a different distribution and outliers. Additionally, the previous researcher identified outliers but did not have enough evidence to support removal. Also, many NA values are still present in the dataset. Some of these are introduced because of the differences in cohorts. For instance, some metadata is entirely attributed to ICU information. That introduces NAs for ER samples. Therefore, it is still crucial to do an in-depth data analysis.
The raw count dataset has a relatively simple structure; column names
are sample IDs and identical to the feature
sample_identifier in the metadata. Additionally, Ensembl
IDs are present as row names. We will retrieve gene names from Ensembl
and use these as row names instead. Raw counts are not normalized and
require a more intensive approach than metadata.
We first start by looking at the metadata after importing all relevant libraries.
We will start loading some libraries – not all because of function
name conflicts (namely dplyr and biomaRt) –
and read all relevant datasets. The reordered (from
helper_functions.R) function ensures sample identifiers are
in order in the metadata and count data.
library(affy)
library(caret)
library(corrplot)
library(dendextend)
library(glue)
library(kableExtra)
library(knitr)
library(reshape2)
library(tidyverse, warn.conflicts = F)
library(visdat)
# Import some essential helper functions
source('scripts/helper_functions.R')
mitocondria.genes <- read.table('data/source/GOCC_MITOCHONDRION.v2023.1.Hs.gmt') %>%
dplyr::select(-c(1, 2)) %>% # Remove first two rows
t() %>%
as.data.frame() %>%
dplyr::rename(gene_id = 1)
meta.data <- read.csv('data/source/GSE185263_meta_data_N392.csv')
raw.counts <- read.csv('data/source/GSE185263_raw_counts.csv') %>%
column_to_rownames('ensembl_gene_id') %>%
reordered(meta.data, .)
ids <- read.csv('data/source/UniqueID.csv', header = T, sep = ';')
gene.products <- read.table('data/source/Mitochondrion_GeneProduct.txt', sep = '\t') %>%
dplyr::rename(gene_id = 1, desc = 2, gene_product = 3)
sra <- read.table('data/source/SraRunTable.txt', sep = ',', header = T)
To give a quick overview of the results, we use the dim
function on every dataset to give an overview of their dimensions. As
shown in Table 3, we have a total of 392 patients, and the mitochondria
gene dataset (1.669 genes) is smaller than the raw counts dataset
(58.389 genes). The total loss of genes is -97.1415849%. The metadata
contains all clinical information and has, therefore, many features. The
other dataset has a relatively small amount of features.
# Dimensions of all datasets
dims <- data.frame(
Meta = c(dim(meta.data)),
Raw.Counts = c(dim(raw.counts)),
Gene.Products = c(dim(gene.products)),
SRA = c(dim(sra))
)
row.names(dims) <- c("Instances", "Features")
# Take a peek
kable(dims, format = "html", booktabs = TRUE, caption = "Dimensions of datasets") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| Meta | Raw.Counts | Gene.Products | SRA | |
|---|---|---|---|---|
| Instances | 392 | 58389 | 1382 | 392 |
| Features | 268 | 392 | 3 | 34 |
The meta-dataset contains all patients’ clinical information, such as
gender, age, and information regarding their ER/ICU visit(s). Since we
have a distinction between these samples and healthy controls, we have a
lot of missing values (NAs). In this section, we will highlight the
missing values and devise a plan to deal with them. That could mean
removing some instances or attributes or needing to fill them up. For
example, a healthy control could get a 0 inserted in an
attribute regarding ER records. Filling up missing values is done in
collaboration with experts. In addition, we look at the distribution of
numeric attributes, and we might need to normalize some.
We look closer at the data types and the number of NAs per feature.
The glimpse function performs this task. The metadata set
contains so many features that displaying them here would not give an
adequate picture. Therefore, a CSV is available in the
misc/ directory. The SRA mainly contains information about
the RNA-Seq data, collection information, creation data, etc. We do not
need this information and will not touch this dataset again. IDs contain
an identifier to NCBI and gene.products the gene name and
their product. We only rename a very long feature for the metadata
set.
glimpse(sra)
## Rows: 392
## Columns: 34
## $ Run <chr> "SRR16193822", "SRR16193823", "SRR16193825", …
## $ Age <int> 55, 22, 88, 47, 65, 75, 18, 57, 21, 57, 24, 1…
## $ Assay.Type <chr> "RNA-Seq", "RNA-Seq", "RNA-Seq", "RNA-Seq", "…
## $ AvgSpotLen <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ Bases <dbl> 2070022600, 1872169000, 1855266400, 107131190…
## $ BioProject <chr> "PRJNA768419", "PRJNA768419", "PRJNA768419", …
## $ BioSample <chr> "SAMN22043762", "SAMN22043763", "SAMN22043764…
## $ Bytes <dbl> 917046299, 784875882, 795907335, 457044135, 5…
## $ Center.Name <chr> "MICROBIOLOGY AND IMMUNOLOGY, UNIVERSITY OF B…
## $ collection_location <chr> "colombia", "colombia", "colombia", "colombia…
## $ collection_site <chr> "Emergency Room", "Emergency Room", "Emergenc…
## $ Consent <chr> "public", "public", "public", "public", "publ…
## $ DATASTORE.filetype <chr> "run.zq,sra,fastq", "sra,run.zq,fastq", "run.…
## $ DATASTORE.provider <chr> "s3,gs,ncbi", "gs,s3,ncbi", "ncbi,gs,s3", "nc…
## $ DATASTORE.region <chr> "s3.us-east-1,gs.US,ncbi.public", "ncbi.publi…
## $ disease_state <chr> "sepsis", "sepsis", "sepsis", "sepsis", "seps…
## $ Experiment <chr> "SRX12478539", "SRX12478538", "SRX12478537", …
## $ in_hospital_mortality <chr> "Survived", "Survived", "Died", "Survived", "…
## $ Instrument <chr> "Illumina HiSeq 2500", "Illumina HiSeq 2500",…
## $ Library.Name <chr> "GSM5609007", "GSM5609006", "GSM5609005", "GS…
## $ LibraryLayout <chr> "SINGLE", "SINGLE", "SINGLE", "SINGLE", "SING…
## $ LibrarySelection <chr> "cDNA", "cDNA", "cDNA", "cDNA", "cDNA", "cDNA…
## $ LibrarySource <chr> "TRANSCRIPTOMIC", "TRANSCRIPTOMIC", "TRANSCRI…
## $ Organism <chr> "Homo sapiens", "Homo sapiens", "Homo sapiens…
## $ Platform <chr> "ILLUMINA", "ILLUMINA", "ILLUMINA", "ILLUMINA…
## $ ReleaseDate <chr> "2022-01-10T00:00:00Z", "2022-01-10T00:00:00Z…
## $ create_date <chr> "2021-10-04T15:04:00Z", "2021-10-04T15:04:00Z…
## $ version <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Sample.Name <chr> "GSM5609007", "GSM5609006", "GSM5609005", "GS…
## $ sex <chr> "female", "female", "female", "male", "female…
## $ sofa_24h_post_admisssion <int> 2, 2, 4, 2, 0, 2, 2, 2, 0, 2, 0, 6, 3, 0, 5, …
## $ source_name <chr> "Whole Blood", "Whole Blood", "Whole Blood", …
## $ SRA.Study <chr> "SRP339949", "SRP339949", "SRP339949", "SRP33…
## $ Tissue <chr> "Whole Blood", "Whole Blood", "Whole Blood", …
glimpse(ids)
## Rows: 392
## Columns: 2
## $ LibraryName <chr> "GSM5608946", "GSM5608947", "GSM5608948", "GSM5608949", "G…
## $ SampleName <chr> "sepcol001", "sepcol002", "sepcol003", "sepcol004", "sepco…
glimpse(gene.products)
## Rows: 1,382
## Columns: 3
## $ gene_id <chr> "GCK", "UQCRQ", "RARS2", "COX10", "TIMM23", "BRAF", "AGTP…
## $ desc <chr> "Hexokinase-4", "Cytochrome b-c1 complex subunit 8", "Pro…
## $ gene_product <chr> "", "", "RARSL", "", "TIM23", "BRAF1|RAFB1", "CCP1|KIAA10…
meta.data <- meta.data %>%
rename(icu_Anti.COVID_last_date = icu_Anti.COVID.Tx...Immune.Modulating.Tx.data.censored.on.date.of....last.sample.sent.)
Now, we zoom in on the metadata and how some important features are
distributed over ER and ICU cohorts. These features are related to SOFA
scores, lactate measurements (an important indicator of mitochondria
dysfunction), measurements of white blood cells, and other essentials.
We tried to calculate the mean, standard error, total availability in
raw and percentage per column named in ess_vars.
table_1 holds all the information for all cohorts, the
function calculate_metrics calculates the distinction
between ER and ICU for every essential column. These calculations give
us a better understanding of how the data is distributed. The results
are depicted in Table 4 and 5.. These calculations give us a better
understanding of how the data is distributed. The results are depicted
in Table 4 and 5.
ess_vars <- c("sepsis_severity", "mortality",
"age", "gender", "first_at_ed_sofa", "icu_sofa",
"worst_within_72_sofa", "worst_within_72_lactate",
"worst_within_72_neutrophil_count", "outcome_hospital_stay_days",
"patient_location", "icu_outcome_icu_stay_days",
"worst_within_72_total_cell_count", "icu_ANC")
table_1 <- meta.data %>%
select(all_of(ess_vars)) %>%
mutate(patient_location =
ifelse(patient_location == 'ward', 'icu', patient_location)) %>%
mutate(age = as.numeric(age),
first_at_ed_sofa = as.numeric(first_at_ed_sofa),
icu_sofa = as.numeric(icu_sofa),
worst_within_72_sofa = as.numeric(worst_within_72_sofa),
worst_within_72_lactate = as.numeric(worst_within_72_lactate),
worst_within_72_neutrophil_count =
as.numeric(worst_within_72_neutrophil_count),
outcome_hospital_stay_days =
as.numeric(outcome_hospital_stay_days)) %>%
mutate(gender = ifelse(gender == 0, 'male', 'female'))
icu <- table_1 %>%
filter(patient_location == 'icu') %>%
select(-patient_location)
er <- table_1 %>%
filter(patient_location == 'er') %>%
select(-patient_location)
calculate_metrics <- function(df) {
# empty dataset to which we can append to
table_one <- data.frame(matrix(ncol = 5, nrow = 0))
names(table_one) <- c("var", "total_av", "mean", "sd", "av")
for (column in names(df)){
col <- df[[column]]
if (is.numeric(df[[column]]) == TRUE) {
mean_value <- round(mean(df[[column]], na.rm = TRUE), 2)
# calculate sd and thereafter standard error
se_value <- round(sd(df[[column]], na.rm = TRUE)/
sqrt(sum(!is.na(df[column]))), 2)
} else {
# for categorical; no means or se
mean_value <- NA
se_value <- NA
}
total_available <- round(sum(!is.na(df[column]))/nrow(df[column])*100, 2)
av <- sum(!is.na(df[column])) # total available
new_row <- c(column, total_available, mean_value, se_value, av)
# bind to "table_one"
table_one <- rbind(table_one, new_row)
}
return(table_one)
}
er_calc <- calculate_metrics(er)
names(er_calc) <- c("var", "total_av", "mean", "se", "av")
icu_calc <- calculate_metrics(icu)
names(icu_calc) <- c("var", "total_av", "mean", "se", "av")
er_calc %>%
filter(av != 0) %>%
kable(format = 'html', table.attr = "class='table table-bordered table-hover'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| var | total_av | mean | se | av |
|---|---|---|---|---|
| sepsis_severity | 100 | NA | NA | 266 |
| mortality | 99.62 | NA | NA | 265 |
| age | 100 | 56.05 | 1.26 | 266 |
| gender | 100 | NA | NA | 266 |
| first_at_ed_sofa | 100 | 1.97 | 0.12 | 266 |
| worst_within_72_sofa | 100 | 1.31 | 0.14 | 266 |
| worst_within_72_lactate | 18.05 | 2.24 | 0.25 | 48 |
| worst_within_72_neutrophil_count | 56.02 | 6.59 | 0.41 | 149 |
| outcome_hospital_stay_days | 97.74 | 7.53 | 0.54 | 260 |
| worst_within_72_total_cell_count | 80.45 | 9.39 | 0.39 | 214 |
icu_calc %>%
filter(av != 0) %>% # remove columns with no available information
kable(format = 'html', table.attr = "class='table table-bordered table-hover'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| var | total_av | mean | se | av |
|---|---|---|---|---|
| sepsis_severity | 100 | NA | NA | 82 |
| age | 98.78 | 61.41 | 1.7 | 81 |
| gender | 98.78 | NA | NA | 81 |
| first_at_ed_sofa | 100 | 7.34 | 0.55 | 82 |
| icu_sofa | 100 | 7.34 | 0.55 | 82 |
| icu_outcome_icu_stay_days | 100 | 12.04 | 0.97 | 82 |
| icu_ANC | 98.78 | 9.48 | 0.67 | 81 |
In the next section, we look at the data distribution of the
metadata. We are interested in the percentage of missing values per
attribute and the data type of each attribute. For this, we use the
library visdat. Since we have many attributes, we subdivide
the amount of attributes into three plots via a function that generates
a plot for every 67 attributes. We do this twice: one for datatypes and
the other for NAs.
loop.vector <- c(1, 67, 134, 201)
missing.values.plot <- lapply(loop.vector, function(x) vis_dat(meta.data[x:(x+67)]) +
labs(title = sprintf("Data type per feature, feature %s-%s", x, x+67), x = "Feature") +
theme(axis.text.x = element_text(angle = 90, hjust = 0, size = 7.5)))
plot(missing.values.plot[[1]])
Data types and NAs for features 1-67. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).
plot(missing.values.plot[[2]])
Data types and NAs for features 67-134. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).
plot(missing.values.plot[[3]])
Data types and NAs for features 134-201. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).
plot(missing.values.plot[[4]])
Data types and NAs for features 201-268. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).
Many NAs are present in the metadata. Basic information such as age
and gender are all filled in. However, cohort-specific features contain
a large share of the missing values. Besides, our metadata is diverse in
the different data types. Next, let us zoom in on the missing values per
feature via the function vis_miss. We will use the
loop.vector variable once more.
plots.missing.values <- lapply(loop.vector, function(x) vis_miss(meta.data[x:(x+67)], cluster = T) +
labs(title = sprintf("Missing values per feature, feature %s-%s", x, x+67), x = "Feature") +
theme(axis.text.x = element_text(angle = 90, size = 6.5)))
plot(plots.missing.values[[1]])
First set of features (1-68) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 40,2% of observations were NAs and 59,8% filled in.
plot(plots.missing.values[[2]])
Second set of features (67-134) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 57,2% of observations were NAs and 42,8% filled in.
plot(plots.missing.values[[3]])
Third set of features (134-201) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 51,4% of observations were NAs and 48,6% filled in.
plot(plots.missing.values[[4]])
First set of features (201-268) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 80,4% of observations were NAs and 19,6% filled in.
As shown in the figures - , many missing values are present in group-specific features. For example, features about patients located in the ICU have no data about ER patients and healthy controls. Since we are dealing with clinical data, it is vital to understand the context behind these missing values. We talked to Nicole Erler, an expert in the field of biostatistics. Erler (n.d.) In addition, “just” filling in a zero was also not the way to go, according to Erler. Our data is too challenging to impute quickly. It would be a sufficient half-year project on its own and probably not helpful. We, therefore, leave the missing values as they are for now. We could return later and still handle one of the more significant features.
To highlight the number of missing values even better, we construct a
table with the number of missing values for every attribute in amount
and percentage of all instances. This allows us to zoom in on those
groups-specific features. We do this by counting all the NAs per
patient_location (ER, ICU) in percentage.
table.missing.values <- meta.data %>%
group_by(patient_location) %>%
summarize(across(everything(),
~round(sum(is.na(.))/n()*100, 2))) %>%
column_to_rownames('patient_location') %>%
t()
# only keep features that exhibit 100% NAs over all cohorts
columns_delete <- table.missing.values %>% as.data.frame() %>%
filter(across(everything(), ~ . == 100))
kable(columns_delete, format = "html", booktabs = TRUE,
caption = "Features that exhibit 100% NAs, across all groups") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| er | healthy_control | icu | ward | |
|---|---|---|---|---|
| cdna_quantity | 100 | 100 | 100 | 100 |
| sequencing_date | 100 | 100 | 100 | 100 |
| worst_within_72_heart_map_target | 100 | 100 | 100 | 100 |
| icu_paired_sample | 100 | 100 | 100 | 100 |
| icu_tech | 100 | 100 | 100 | 100 |
| icu_cdna_prep_date | 100 | 100 | 100 | 100 |
| icu_cdna_quantity | 100 | 100 | 100 | 100 |
| icu_sequencing_date | 100 | 100 | 100 | 100 |
| icu_column_kit | 100 | 100 | 100 | 100 |
| icu_notes_sequencing_prep | 100 | 100 | 100 | 100 |
| icu_hospital_name | 100 | 100 | 100 | 100 |
| icu_recruitment_date | 100 | 100 | 100 | 100 |
| icu_triage_time | 100 | 100 | 100 | 100 |
| icu_time_first_seen_by_ed_physician | 100 | 100 | 100 | 100 |
| icu_ctas_score | 100 | 100 | 100 | 100 |
meta.data <- meta.data %>%
select(-all_of(rownames(columns_delete))) %>%
mutate(patient_location =
ifelse(patient_location == 'healthy_control', 'hc', patient_location))
We visualize the results in Table and see that a few features are empty for all cohorts. We can safely delete these features.
We use the nearZeroVar function to delete some near-zero
variables. As we did for glimpse, the result of the metadata set is
available as CSV under misc/ (only for internal readers).
We see some samples that exhibit not much variance. Removing these is
still too early, but it is a valuable indicator. We will look at these
samples in the section regarding the counts dataset.
#l <- nearZeroVar(meta.data, names=T) %>% write.csv("near_zero_meta.csv")
nearZeroVar(raw.counts, names=T)
## [1] "sepcol024" "sepcol027" "sepnet1347" "sepnet1351" "sepnet1385"
## [6] "sepnet1415" "sepwes015" "sepwes021" "sepwes026" "sepwes032"
## [11] "sepwes034" "sepwes044" "sepwes052" "sepwes058" "sepwes070"
## [16] "sepwes080" "sepwes103" "hcwes025" "hcwes027" "sepcv006T0"
nearZeroVar(sra, names=T)
## [1] "Assay.Type" "BioProject" "Center.Name" "Consent"
## [5] "Instrument" "LibraryLayout" "LibrarySelection" "LibrarySource"
## [9] "Organism" "Platform" "ReleaseDate" "version"
## [13] "source_name" "SRA.Study" "Tissue"
nearZeroVar(gene.products, names=T)
## character(0)
nearZeroVar(ids, names=T)
## character(0)
In this section, we look at the distribution of the meta-dataset.
Since the meta dataset contains many attributes of the
character type, we use a barplot to discover relationships
between various features. Our primary interest is visualizing with the
class variables, namely mortality,
sepsis_severity, endotype_name, and
sample_location. The first two named here are, of course,
the most essential to look at. To accomplish this, we made a
counter function that counts the amount of instances in two
different features and summarizes it in a barplot. The
!!rlang::sym(x) can be confusing; for instance,
rlang::sym(x) takes the string x and converts
it into a symbol. After that, !! is used for unquoting.
counter <- function(dataset, target, features) {
agg <- dataset %>%
select(all_of(features), !!sym(target)) %>%
pivot_longer(cols = all_of(features), names_to = "feature", values_to = 'valuation') %>%
group_by(!!sym(target), feature, valuation) %>%
summarize(count = n(), .groups = 'drop')
ggplot(agg, aes(x = valuation, y = count, fill = !!sym(target))) +
geom_col() +
facet_wrap(~ feature, scales = "free_x") +
labs(x = "Value", y = "Count",
title = glue("Various features distribution based on {target}"))
}
In figure here, we see that the Toronto cohort is indeed for the ICU,
and most samples are from the ER. Most people, no matter the admission,
left the hospital alive; only a small portion died, which means the
mortality variable may be limited in scope regarding DE
extraction. Also, this variable contains a lot of NAs. And the entire
ICU cohort has no information on mortality. For
sepsis_severity, most High severity cases come from the ICU
cohort and are also the smallest. This class imbalance may be
problematic, especially for ER.
counter(meta.data, 'sample_location', c('sepsis_severity',
'patient_location',
'mortality'))
Distributions of sample_location with
mortality, patient_location, and
sepsis_severity. Sample location is an important variable
since it shows the differences between the cohorts.
The figure below takes the severity variable as the counter. We see that most samples are between the age group of 51-60 to 71-80.
counter(meta.data, 'sepsis_severity',
c('age_group', 'patient_location', 'mortality', 'endotype_name'))
Distributions of sepsis_severity with
mortality, endotype_name (established in the
Baquir study), mortality, and
patient_location. Severity is spread out over the age
group. The amount of samples also increases with older populations. In
contrast to the ER cohort, ICU has the largest share of High-severity
cases in percentage.
In figure below the age_group is used as the counter. Low severity is more pronounced in younger populations, and almost no one under the age of forty died because of sepsis.
counter(meta.data, 'age_group', c('mortality', 'sepsis_severity'))
Distributions of age_group with mortality and
sepsis_severity. Low severity exhibits a higher share of
younger populations, whereas older populations are more abundant in High
and Intermediate subgroups. This phenomenon can also be seen in the
mortality variable; younger age groups are less represented in the
subcategory deceased.
Next, we mutate all the attributes that are supposed to be of the
numeric type to that type. We do this since we are
interested in the distribution of numeric values to determine whether we
need to normalize/standardize these attributes. The attributes
consisting of 0/1 will also be converted to TRUE and
FALSE, respectfully. Thereafter, we will, as a temporary
measure, select all numeric valuations, fill any NA in with 0 until
later when they are filled correctly, and compare them. We plot per 20
features because of the larger number of features; the total number of
numeric features is 116. We use the function
plot_distributions for this. This function splits the data
into indices specified by the user and shows the distribution of a
particular variable. We use patient_location to color in
the distribution. This function can, in practice, be used for all data
types.
meta.data <- meta.data %>%
mutate_if(~ all(. %in% c(0, 1, NA)), ~ as.logical(.))
numeric_data <- meta.data %>%
select_if(is.numeric)
plot_distribution <- function(dataset, col_length, color_var) {
p <- list()
# calculate how many parts we need to split the data in
indices <- ceiling(ncol(dataset)/col_length)
# which variable is to be distinct upon (coloring in)
color_var <- as.factor(color_var)
for (i in 1:indices) {
# calculate the start and end index
start_col <- (i-1)*col_length+1
end_col <- i*col_length
if(end_col > ncol(dataset)) {
end_col <- ncol(dataset)
}
p[[i]] <- dataset[, start_col:end_col] %>%
# if we have character types, conver to factor and then to numeric
mutate_if(is.character, as.factor) %>%
mutate_if(is.factor, as.numeric) %>%
bind_cols(color_var) %>%
# put the color_column as the last column (it might not be in 'dataset' param)
rename(color_column = last_col()) %>%
# longer format since we want to count
pivot_longer(-color_column,
names_to = "features", values_to = "values") %>%
ggplot(aes(x = values, fill = color_column)) +
geom_histogram(bins = 15) +
# for every feature, make a "small" plot
facet_wrap(~ features, scales = "free") +
theme(strip.text = element_text(size = 7))
}
return(p)
}
without.nas <- plot_distribution(numeric_data, 20, meta.data$patient_location)
# turn the NAs into 0 for comparison
numeric_data <- numeric_data %>%
mutate_all(~ifelse(is.na(.), 0, .))
with.nas <- plot_distribution(numeric_data, 20, meta.data$patient_location)
By comparing all of the above figures, we see the influence of the NAs in action. As we highlighted, the number of NA values is cohort-dependent and needs special attention when using specific (later established significant) features. However, we can still use most of the features depicted here in a cohort fashion. For example, we make separate datasets for both when analyzing the ER cohort.
In the next section, we have a plot_corrrplot, which
makes a corrrplot based on parameters a user can specify. The idea is to
use all data for the correlation analysis but do it based on
sample_location (Australia, Netherlands, etc.) to neglect
the NA distribution a bit. However, NAs are still present and cannot be
fully combatted. One disadvantage to this method is that some essential
variable-variable comparisons cannot be calculated due to missing
values. We only extract correlations above a certain threshold (standard
is 0.7 using the parameter sig_thres). group
simply refers to the group you perform the function on.
plot_corrrplot <- function(data, group=NULL, sig_thres = 0.7,
method = 'spearman',
use='everything', pre=T) {
if (pre) {
# only for datasets with these attributes
data <- data %>% as.data.frame() %>%
select(-c(sample_identifier, GEO_identifier, sample_identifier_raw))
}
correlation_df <- data %>% as.data.frame() %>%
mutate_if(is.character, as.factor) %>% # from character to factor
mutate_if(is.factor, as.numeric) %>% # from factor to int
cor(method = method, use = use) %>%
as.data.frame() %>%
# Put the variable (rowname) to a column
rownames_to_column('variable') %>%
# longer format - compare each variable against all others
pivot_longer(-variable, names_to = 'comp2', values_to = 'score') %>%
drop_na() %>% # drop nas and diagonal
filter(abs(score) > sig_thres & score != 1) %>%
arrange(desc(abs(score))) %>%
# dataframe to matrix for corrplot function
acast(variable~comp2, value.var="score")
corrplot(correlation_df, method ='ellipse', na.label = " ",
tl.cex = 0.7, is.corr = F, type = 'lower',
diag = F, main = glue("Correlation plot for {group}"))
}
corr_plots <- meta.data %>%
group_by(sample_location) %>%
group_walk(~ plot_corrrplot(.x, .y, 0.8))
It is hard to explain every correlation here, but in most figures
above we can conclude that patient_location has a good
correlation between various sepsis SOFA features (e.g.,
sepsis-3). Also, the class variable
sepsis_severity has high correlations with other SOFA
features (e.g., first_at_ed_sofa and
icu_sofa).. We won’t be deleting any of these features.
As a final zoom-in of the metadata set, we looked at the SOFA scores
per sample_location in figure . When calculating these
boxplots, we removed the healthy controls entirely. Then we selected
four different, highly correlated features with
sepsis_severity. We can conclude that out of all the ER
cohorts, Australia has the highest overall SOFA scores (not with qSOFA).
Also, as expected, ICU cohort has the largest range of SOFA scores
(between 15+ and ~ 2).
meta.data %>%
filter(patient_location != 'hc') %>%
select(icu_sofa, first_at_ed_sofa, sample_location,
at_ed_qsofa, worst_within_72_sofa) %>%
pivot_longer(-sample_location) %>%
ggplot(aes(x = name, y = value, color = sample_location)) +
geom_boxplot() +
labs(x = 'SOFA metric', y = 'Value',
title = 'SOFA score boxplots per sample location')
Boxplots for every cohort in various SOFA-related variables. The ER cohort Australia exhibited the highest SOFA scores of all the ER cohorts.
In this section, we will first look at the entire raw counts dataset
and zoom in on the mitochondria-related genes later. It is essential to
first look for outliers and other anomalies here since this dataset
covers all of the variance, which also impacts the mitochondria-related
genes. First, we use DESeq2, a much-used library in a
standard DE analysis workflow. We use the ~1 design to not
specify any covariates - we are not comparing different conditions yet.
However, this allows us to use some of DESeq2’s tools.
We use the whole population here, including the healthy controls. We perform PCA to get a basic understanding of the gene distribution at a sample level.
library(corrr)
library(reshape2)
library(pheatmap)
library(PoiClaClu)
library(grid)
library(gridExtra)
library(DESeq2)
library(factoextra)
( ddsMat <- DESeqDataSetFromMatrix(countData = raw.counts,
colData = meta.data, design = ~ 1) )
## class: DESeqDataSet
## dim: 58389 392
## metadata(1): version
## assays(1): counts
## rownames(58389): ENSG00000000003 ENSG00000000005 ... ENSG00000285509
## ENSG00000285513
## rowData names(0):
## colnames(392): sepcol001 sepcol002 ... hchlD218 hchlD219
## colData names(253): sample_identifier GEO_identifier ... mortality
## day_to_death
# normalize with VST
rld.dds.hc <- vst(ddsMat)
# scale=F due to low-expressed genes
pca_hc <- perform_pca(assay(rld.dds.hc), scale=F)
prepare_pca <- function(meta, pca_df) {
pca_scores <- pca_df$pca$x %>%
as_tibble(rownames = 'sample_identifier') %>%
# merge with metadata
full_join(x = ., y = meta, by = 'sample_identifier') %>%
# mutate mortality to unknown if NA
mutate(mortality = ifelse(is.na(mortality), 'unknown', mortality))
# give the variance in percentage its own column (for plot_pca specifically)
pca_scores <- pca_scores %>%
mutate(var = pca_df$eigenvalues$pva)
return(pca_scores)
}
plot_pca <- function(dataset, x_var='PC1', y_var='PC2', fill=NULL,
color=NULL, size=NULL, shape=NULL, meta=NULL) {
if (!is.null(meta)) {
# prepare a dataset
dataset <- prepare_pca(meta, dataset)
}
var <- dataset$var
ggplot(dataset, aes_string(x = x_var, y = y_var,
color = color, fill = fill,
size = size, shape = shape)) +
geom_point() +
labs(title = glue("PCA regarding {color}"),
x = glue("PC1 ({round(var[1], 2)}%)"),
y = glue("PC2 ({round(var[2], 2)}%)")) +
theme(plot.title = element_text(size = 10))
}
In the figure below, we see that the healthy controls form their own cluster. Even if it is just a little bit, we still see that samples from different clusters have a slight difference, even for healthy controls (Netherlands and Australia). ER has a large range, but based on sample location, there might be a bit of a bias here. We will look further into it. However, the ICU has a smaller range but overlaps with ER a lot.
plot_pca(pca_hc, color='patient_location', shape='sample_location', meta = meta.data)
PCA plot based on sample location, with the healthy controls included. As we can see, the healthy controls form their cluster, whereas the ER, ward, and ICU cohorts are much more spread out. The biggest cohort, ER, especially has a lot of overlap between its sample locations. However, the ICU also overlaps with the ER. In addition, samples from the same location cluster together well, stressing the need for a batch correction.
We will now filter out the healthy controls and focus entirely on septic patients from now on.
meta.sepsis <- meta.data %>%
filter(!patient_location %in% c('healthy_control', 'hc'))
sepsis.counts <- raw.counts %>%
select(-starts_with('hc')) %>%
reordered(meta.sepsis, .)
# outer () prints a summary!
( ddsMat <- DESeqDataSetFromMatrix(countData = sepsis.counts,
colData = meta.sepsis, design = ~ 1) )
## class: DESeqDataSet
## dim: 58389 348
## metadata(1): version
## assays(1): counts
## rownames(58389): ENSG00000000003 ENSG00000000005 ... ENSG00000285509
## ENSG00000285513
## rowData names(0):
## colnames(348): sepcol001 sepcol002 ... sepcv702W1 sepcv703W1
## colData names(253): sample_identifier GEO_identifier ... mortality
## day_to_death
In addition to PCA, we will plot the density of the sepsis-related
count data in its unnormalized form and three different standardization
and normalization efforts—namely, min-max, log2, and VST. The latter is
made explicitly for normalizing RNA-Seq data; it is designed to
stabilize the variance across the range of mean values in count data. We
use blind=TRUE (standard) since we do not consider
differences between samples yet. assay() is used to extract
normalized counts. Figures below cover the normalization techniques. We
are plotting with plotDensity() from the library
affy.
min.max.normalization <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
counts.log <- log2(sepsis.counts + 0.01) # + 0.01 to avoid zero valuations
counts.norm <- min.max.normalization(sepsis.counts)
rld.dds <- vst(ddsMat)
rld <- assay(rld.dds)
plotDensity(sepsis.counts, xlab = 'Unnormalized counts',
main = 'Expression distribution (unnormalized)')
Distribution of unnormalized RNA-Seq counts from septic patients.
The figure above concludes the need for normalization; count values are way too distributed.
plotDensity(counts.log, lty=c(1:ncol(counts.log)),
xlab = 'Log2(sepsis.counts)',
main = 'Expression distribution (log2 normalized)')
Density plot on log-normalized counts.
The log2-normalization technique already improved the distribution a lot.
plotDensity(counts.norm, lty=c(1:ncol(counts.log)),
xlab = 'sepsis.counts',
main = 'Expression distribution (min-max standardized)')
Density plot on min-max-standardized counts.
Looks the same as the unnormalized distribution and has the problem of still skewing the distribution, making it sensitive to outliers.
plotDensity(rld, lty=c(1:ncol(rld)), xlab = 'sepsis.counts',
main = 'Expression distribution (VST standardized)')
Density plot on VST normalized counts.
VST normalization is one of the preferred methods in RNA-Seq data and is useful for downstream analysis. The distribution range is still large, but that is normal in RNA-Seq. With that, we use this normalization technique during the rest of the project.
Next, we experimented with different parameters to filter out
low-expressed genes. Typically, counts beneath ten are considered
useless, exhibited at least over x samples. We tried to count thresholds
of 1-15 (in low.exp), at least over an x amount of samples
(in sample), and whether to normalize the counts first and,
after that, remove a certain threshold (in norm) below. We
covered 300 different parameter settings. Unfortunately, there are too
many to show in a table format, and gene size varies greatly. We went
with counts that cannot be lower than or equal to 10 and at least over
ten samples. We did not normalize before filtering out; this resulted in
too many lost genes. Additionally, it is standard practice not to do
this before filtering out low-expressed genes. After that, we compared
gene sets with and without low-expressed genes. After filtering, the
variance increased slightly for both PC1 and PC2, which is a good
sign.
# to get estimateSizeFactors to use `normalized` parameter
dds <- DESeq(ddsMat)
different_params <- data.frame(outcome = numeric(), used_comb = character())
combinations <- crossing(low.exp = 1:15, sample = 1:10, norm = c(TRUE, FALSE))
for(i in 1:nrow(combinations)) {
combination <- combinations[i, ]
keep <- rowSums(counts(dds, normalized=TRUE) >= combination$low.exp) >= combination$sample
outcome <- ddsMat[keep, ] %>% dim() # only retain the amount of counts
# make a reference to this combination
used_comb <- glue("{combination[1]}-{combination[2]}-{combination[3]}")
different_params <- bind_rows(different_params,
data.frame(outcome = outcome[1], used_comb = used_comb))
}
different_params <- different_params %>%
separate(used_comb, into = c("low_exp", "sample", "norm"), sep = "-")
head(different_params)
keep <- rowSums(counts(dds) >= 10) >= 10
dds <- dds[keep, ]
rld.two <- assay(vst(dds))
# with and without low-expressed genes
without_low <- perform_pca(rld, scale = F) %>%
plot_pca(color = "patient_location", meta = meta.sepsis)
with_low <- perform_pca(rld.two, scale = F) %>%
plot_pca(color = "patient_location", meta = meta.sepsis)
grid.arrange(without_low, with_low, ncol=2,
top = "PCA plots: Influence of low-expressed genes")
PCA plot with (a) and without (b) low-expressed genes. Variance increases without the low-expressed genes.
In the table below, we can see how many we “lost”. We went from 58.389 to 17.825 genes with sufficient expression. Namely, a reduction of 1.7824^{4}%.
data.frame(
before = dim(rld),
after = dim(rld.two)
) %>%
kable(format = 'html', booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| before | after |
|---|---|
| 58389 | 17825 |
| 348 | 348 |
# overwrite rld for simplifity
rld <- rld.two
We have used PCA before. We do this with function from the
helper_functions.R library (own functions). As a note,
since we lost all the low-expressed genes, we now can safely use the
scale=TRUE parameter. We will only show it once here, but
the rest of the project it will be seen as standard usage. We, however,
lost some of the variance because we scaled.
pca_res <- perform_pca(rld, scale = T)
The following three plots visualize the total variance explained per
PC. As we can see in figure, the first PC contains more than 15% of the
total variance, and the second one as well, and after that, the variance
explained per PC decreases to less than 5% per additional PC. Figure
cumulatively shows this. The inclusion of 89 PCs explains 80% of the
total variance. We summarize both described figures in figure , a
so-called Pareto plot. This figure shows how the cumulative variance
increases per PC via the black-dotted line and stops at around 89 PCs to
describe 80% of the total variance. Using this as a dimension reduction
effort, we would go from >17.000 genes (/features) to 89 features. We
calculate with the explain_variance function the amount of
PCs we need to explain 80% of the variance.
ycord <- explain_variance(pca_res$eigenvalues$p_cum, 80)
ggplot(pca_res$eigenvalues[1:10, ], aes(x = PC, y = pva)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Scree plot", x = "PC", y = "Variance in percentage")
Scree plot with the first 10 principal components
ggplot(pca_res$eigenvalues[1:ycord, ], aes(x = PC, y = p_cum)) +
geom_col(color = "black") +
geom_hline(yintercept = 80,
linetype = 'dashed', color = 'red') +
labs(title = "Variance explained by principal components",
x = 'PC',
y = "Cumulative percentage variance explained") +
theme(axis.text.x = element_text(angle = 90,size = 7))
Barplot with cumulative variance explained in percentage per principal component. Red line indicates 80% of total variance explained.
# Pareto chart
ggplot(pca_res$eigenvalues[1:ycord,], aes(x = PC)) +
geom_col(aes(y = pva)) +
geom_line(aes(y = p_cum, group = 1)) + # group = 1: treat all data as one
geom_point(aes(y = p_cum)) +
geom_hline(yintercept = 80, linetype = 'dashed', color = 'red') +
labs(x = "Principal Component", y = "Fraction variance explained") +
theme(axis.text.x = element_text(angle = 45, size = 7))
Pareto chart of the first 95 principal components. Each principal component expresses a certain percentage of total variance explained. The cumulative percentage of variance explained is summarized in the dotted line, which stops when it reaches the red line, indicating 80% of total variance explained.
Next, we cluster around multiple class variables. The figure here is
most important as we cluster around sepsis_severity and
mortality. This figure shows that the spread of all groups
of samples seems random; no distinction of any groups can be seen here.
This is concerning as a non-obvious way of grouping is achievable with
these class variables; a supervised machine learning algorithm might
need to be able to predict accurately.
prep_pca <- prepare_pca(meta.sepsis, pca_res)
plot_pca(prep_pca, color = 'sepsis_severity', shape = 'mortality')
Clustering on sepsis_severity with groups High in red,
Intermediate in green, and Low in blue; and mortality with
groups ‘unknown’ (former NA; temporary measure), deceased as triangle,
and survived as square. Clusted groups seem random, and no major
distinction can be made.
As observed before, in the metadata section, we expected some effects
from the various sample locations. Now, we will try to visualize the
effects through PCA (and also visualize other features). If we look at
the first subfigure in figure below, we see that
sample_location clusters well. This is the batch effect -
cluster form because they are from the same batch. We will correct this.
To accomplish this, we made a function in
helper_functions.R while using SVA’s
ComBatt function on the sequence_month_year
variable. This way, we have a function that corrects with
DESeq2’s VST and does batch correction.
In the next section, we perform the batch correction and thereafter
do PCA again. If we look at the same subfigure, we see that now the
sample_location feature does not cluster so well
anymore.
counts.batch <- normalize(assay(dds), meta.sepsis)
## Found 238 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
# perform pca again, now on batch-corrected counts
pca_res_batch <- perform_pca(counts.batch, scale = F)
prep_pca <- prepare_pca(meta.sepsis, pca_res_batch)
multiple_pcas <- lapply(c('sample_location', 'mortality',
'endotype_name', 'sepsis_severity'),
function(x) plot_pca(prep_pca, color = x))
grid.arrange(grobs = multiple_pcas,
top = "PCA on multiple variables with batch correction")
Outliers are though to identify just by looking at a PCA plot since
our data does not cluster all that well, especially on class variables.
Therefore, we turned to sklearn’s
IsolationForest. This model works by isolating anomalies
based on low frequency and being different. It works well with
high-dimensional datasets and works quickly. Its use case is typically
outside of RNA-Seq datasets but can be a good indicator of outliers. We
will not solely rely on this model to remove entire samples; it is just
a way of labeling potential outliers. We will still zoom in on these
so-called outliers and decide whether to remove them. RNA-Seq data can
be noisy and complex, and some outliers can be biologically interesting
to look at. IsolationForest is a Pythonic model and is
therefore not included in this notebook. Please find it in the
detect_outliers.ipynb.
The samples in possible_outliers have been identified by
IsolationForest. We compare their gene expression to twenty
randomly chosen ‘inliner’ samples. To accomplish this, we needed a long
format to count all the gene expressions per sample.
# possible outliers detected by IsolationForest
possible_outliers <- c(
'sepcol002',
'sepcol007',
'sepcol014',
'sepcol061',
'sepcol065',
'sepwes014',
'sepcv010T0')
inliner_samples <- rld %>% as.data.frame() %>%
select(!all_of(possible_outliers))
# sample 20 inliners
names_samples <- sample(names(inliner_samples), 20)
sample_df <- inliner_samples %>%
select(all_of(names_samples)) %>%
rownames_to_column('gene')
outlier_samples <- rld %>% as.data.frame() %>%
select(all_of(possible_outliers)) %>%
rownames_to_column('gene')
# switch to long format and give outlier and inliner different colors
combined_long <- outlier_samples %>%
right_join(sample_df, by = 'gene') %>%
pivot_longer(-gene, names_to = "sample", values_to = "gene_exp") %>%
mutate(color = ifelse(sample %in% names_samples, "red", "blue"))
In the next plot with boxplots, we distinguish between outlier and inline expression patterns. The outliers detected by IsolationForest do not differ as much from the inliners. We will remove them and see the results in PCA.
ggplot(combined_long, aes(x = sample, y = gene_exp, fill = color)) +
geom_boxplot() +
labs(title = "Boxplot of outliers and a random set of inliners",
x = "Sample", y = "Gene exxpression") +
scale_fill_discrete(labels = c("red" = 'Outlier', "blue" = "Inliner")) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
guides(fill = guide_legend(title = "Type"))
Through outlier detection by IsolationForest, we compare these outliers (in blue) to twenty random samples from the inliner population (in red). Only two of these sample outliers show a large distinction to the inliners.
In the next section, we removed the samples in
possible_outliers and compared the results to previous PCA
plots to measure the outliers’ influence. Concluding that removing the
outliers would not “solve” many clustering problems. Therefore, we will
not be removing outliers here. We will look at outliers in the
mitochondria-related gene section once more.
pca_res_inliners <- perform_pca(inliner_samples)
red_meta <- meta.sepsis %>%
filter(!sample_identifier %in% possible_outliers)
plot_pca(pca_res_inliners, meta = red_meta, color = 'sepsis_severity')
PCA on inliner population on sepsis_severity to measure the
influence of the outliers detected by IsolationForest.
We use the PC loadings to determine which genes contribute the most
to the first two PCs. We accomplish this by using the function
top_loadings. This function first decides what target to
use; if it is a gene, we take a closer look at its loadings established
in PCA. If not, it might be a sample, and then we use the standard PCA
(through $x). We then select the first two PCs and
determine which genes or samples have the highest score for each of the
PCs. Then, we slice five from each, resulting in 10 interesting genes or
samples. Thereafter, we plot the reduced PCA dataset with the
fviz_pca_var from the package factoextra. The
length of the arrow here depicts the ‘importance’ of the gene or sample
to the established variance of that said PC. The direction says
something about how correlated genes are with others (and whether they
are negatively or positively correlated).
top_loadings <- function(pca_df, target) {
if (target == 'gene') {
# if target is gene, pick loadings
pc_loadings <- as.data.frame(pca_df$pca$rotation) %>%
rownames_to_column(target)
} else {
# else go for the x, the PCs
pc_loadings <- as.data.frame(pca_df$pca$x) %>%
rownames_to_column(target)
}
top_performers <- pc_loadings %>%
select(!!sym(target), PC1, PC2) %>%
# to long format
pivot_longer(-!!sym(target), names_to = 'PC', values_to = 'score') %>%
group_by(PC) %>%
mutate(absolute_Score = abs(score)) %>%
arrange(desc(absolute_Score)) %>%
# slice the best-performing for each PC
slice_max(order_by = absolute_Score, n = 5) %>%
# only keep unique genes (can overlap between PCs)
distinct(!!sym(target))
return(top_performers)
}
top_genes <- top_loadings(pca_res_batch, 'gene')
fviz_pca_var(pca_res_batch$pca, col.var="contrib",
gradient.cols = c("blue", "orange", "red"),
repel = TRUE, select.var = list(name=top_genes$gene))
Top-contributing genes’ loadings in the first PCA components, contributing most to the variance.
We extract the top-contributing genes from the rld
(VST-normalized counts) and depict their expression with each other in a
heatmap. Some genes have high expression rates (>2), especially gene
ENSG000001609132. We do not know their gene name or if they
are mitochondria-related. However, we have saved the result into a CSV
and will look later to see if these are mitochondria-related genes.
top_scores <- rld %>%
as.data.frame() %>%
rownames_to_column('gene') %>%
filter(gene %in% top_genes$gene) %>%
column_to_rownames('gene') %>%
scale()
pheatmap(top_scores, show_colnames = F,
main = "Heatmap of the top-contributing genes", scale='column')
Heatmap of top-contributing genes, five for each of the first two components.
We do a similar process for samples, but now we try to extract
samples that contribute best to the first two PCs based on variance. We
extracted the top 10 most contributing for the first five PCs. Most
samples come from the Toronto cohort, recognizable from the
col notation.
top_score <- top_loadings(pca_res, target = 'id')
pheatmap(top_scores[, 1:5], show_rownames = F, show_colnames = F,
main = "Heatmap of the top-contributing samples")
We now turn to hierarchical clustering, exploring if there is a way
to cluster our gene expression. We perform a simple hierarchical
clustering method based on a reasonably standard workflow: the distance
method is euclidean, and the linkage method
ward.D2. After that, we cut the tree in three (denoted by
the fact that we have three levels of severity in
sepsis_severity). The dendrogram depicted below has evenly
big clusters.
dist_matrix <- dist(pca_res_batch$pca$x[, 1:10], method = 'euclidean')
hc <- hclust(dist_matrix, method = 'ward.D2')
cluster_labels <- cutree(hc, 3)
dend_colored <- color_branches(as.dendrogram(hc), k = 3,
labels = cluster_labels)
plot(dend_colored, hang = -1, horiz = TRUE, leaflab = "none")
Hierarchical clustering on principal components (the first ten) to establish which samples are closely related. However, due to the grand total of samples, it is hard to observe relationships. Therefore, other figures that paint a clearer picture are constructed.
We now associate the established clusters in the previous graph with
sepsis_severity by joining them based on the sample
identifier. Thereafter, we calculate and visualize the findings through
PCA. We see that the severity variable does not cluster well.
cluster_groups <- data.frame(PC1 = pca_res_batch$pca$x[, 1],
PC2 = pca_res_batch$pca$x[, 2],
var = pca_res_batch$eigenvalues$pva,
cluster = factor(cluster_labels)) %>%
rownames_to_column('sample_identifier') %>%
left_join(meta.sepsis, by = 'sample_identifier')
plot_pca(cluster_groups, 'PC1', 'PC2',
color = 'sepsis_severity', shape = 'cluster')
Plot the new grouping based on the results from hierarchical clustering and to see if clusters have any overlap with severity. Severity does not align with the established clusters in hierarchical clustering.
To conclude our findings, we plot the two PCs into density plots in figures below. Clusters are somewhat evenly large for both PCs.
cluster_long <- cluster_groups %>%
pivot_longer(cols = c(PC1, PC2))
ggplot(cluster_long, aes(x = value, fill = cluster)) +
geom_density(alpha = 0.7) +
facet_wrap(~ name) +
labs(title = "Density Plot of PC1 by clusters")
Density plots of PC1 and PC2 and where the clusters are located.
In conclusion, we looked at correlations, distributions, and missing
values for the metadata set. However, due to the cohort dependency of
these NAs, it was hard to impute them. According to missing value expert
Erler, failing in a zero is also not the way to go. It creates a
distortion of statistics and misrepresents the true data. In addition,
we remove near-zero features from the metadata set. A large number of
missing values in mortality, especially in the ICU cohort,
can become problematic for downstream analysis.
We did not remove any samples, but there are outliers in our count
data. However, getting evidence of removing these outliers is difficult
since we do not see any formation of clusters on class variables. We
also decided to take that VST normalization and batch
correction were necessary. This produced better distribution for the
entire transcriptome and mitochondria-related genes. We also discovered
that the sepsis_severity class variable does not cluster
well. This is not a strange phenomenon in RNA-Seq and confirms the
challenges of the heterogeneous nature of sepsis.
Baghela, A. et al. “Predicting sepsis severity at first clinical presentation: The role of endotypes and mechanistic signatures”. eBioMedicine vol. 75, 103776, January 2022. DOI:https://doi.org/10.1016/j.ebiom.2021.103776 Erler. N: https://www.nerler.com/ https://www.ensembl.org/index.html https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GOCC_MITOCHONDRION